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ABSTRACT 

For a learning task, data can usually be collected from dif¬ 
ferent sources or be represented from multiple views. For 
example, laboratory results from different medical examina¬ 
tions are available for disease diagnosis, and each of them 
can only reflect the health state of a person from a particular 
aspect/view. Therefore, different views provide complemen¬ 
tary information for learning tasks. An effective integration 
of the multi-view information is expected to facilitate the 
learning performance. In this paper, we propose a general 
predictor, named multi-view machines (MVMs), that can 
effectively include all the possible interactions between fea¬ 
tures from multiple views. A joint factorization is embed¬ 
ded for the full-order interaction parameters which allows 
parameter estimation under sparsity. Moreover, MVMs can 
work in conjunction with different loss functions for a vari¬ 
ety of machine learning tasks. A stochastic gradient descent 
method is presented to learn the MVM model. We fur¬ 
ther illustrate the advantages of MVMs through comparison 
with other methods for multi-view classification, including 
support vector machines (SVMs), support tensor machines 
(STMs) and factorization machines (FMs). 
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I. INTRODUCTION 

In the era of big data, information is available not only 
in great volume but also in multiple representations/views 
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from a variety of sources or feature subsets. Generally, dif¬ 
ferent views provide complementary information for learning 
tasks. Thus, multi-view learning can facilitate the learning 
process and is prevalent in a wide range of application do¬ 
mains. For example, to fulfil an accurate disease diagnosis, 
one should consider laboratory results from different medi¬ 
cal examinations, including clinical, imaging, immunologic, 
serologic and cognitive measures. For the business on the 
web, it is critical to estimate the probability that the dis¬ 
play of an ad to a specific user when s/he searches for a 
query will lead to a click. This process involves three enti¬ 
ties: users, ads, and queries. An effective integration of the 
features describing these different entities is directly related 
to a precise targeting of the advertising system. 

One of the key challenges of multi-view learning is to 
model the interactions between different views, wherein com¬ 
plementary information is contained. Conventionally, multi¬ 
ple kernel learning algorithms combine kernels that naturally 
correspond to different views to improve the learning perfor¬ 
mance . Basically, the coefficients are learned based on the 
usefulness/informativeness of the associated views, and thus 
the correlations are considered at the view-level. These ap¬ 
proaches, however, fail to explicitly explore the correlations 
between features. In contrast to modeling on views, an¬ 
other direction for modeling multi-view data is to directly 
consider the abundant correlations between features from 
different views. 

In this paper, we propose a novel model for multi-view 
learning, called multi-view machines (MVMs). The main 
advantages of MVMs are outlined as follows: 

• MVMs include all the possible interactions between 
features from multiple views, ranging from the first- 
order interactions (z.e., contributions of single features) 
to the highest order interactions (z.e., contributions of 
combinations of features from each view). 

• MVMs jointly factorize the interaction parameters in 
different orders to allow parameter estimation under 
sparsity. 

• MVMs are a general predictor that can work with dif¬ 
ferent loss functions {e.g., square error, hinge loss, logit 
loss) for a variety of machine learning tasks. 

2. MULTI-VIEW CLASSIFICATION 

We first state the problem of multi-view classification and 
introduce the notation. Table lists some basic symbols 
that will be used throughout the paper. 


Table 1: Symbols. 

Symbol Definition and Description 


s 

V 

M 

r 





each lowercase letter represents a scale 

each boldface lowercase letter represents a vector 

each boldface capital letter represents a matrix 

each calligraphic letter represents a tensor, set or space 

denotes inner product 

denotes tensor product or outer product 

denotes mode-k product 

denotes absolute value 

denotes (Frobenius) norm of vector, matrix or tensor 



Figure 1: CP factorization. The third-order (m = 3) 
tensor W is approximated by k rank-one tensors. 
The f-th factor tensor is the tensor product of three 
vectors, i.e., af ^ 


Suppose each instance has representations in m different 
views, i.e., , where x^”’ e Iv 

is the dimensionality of the u-th view. Let d = 
so X G Considering the problem of click through rate 
(CTR) prediction for advertising display, for example, an in¬ 
stance corresponds to an impression which involves a user, 
an ad, and a query. Therefore, if x^ = 

is an impression, x^^^ contains information of the user pro¬ 
file, x^^^ is associated with the ad information, and x^^^ is 
the description from the query aspect. The result of an im¬ 
pression is click or non-click. 

Given a training set with n labeled instances represented 
from m views: T> = {(xi,^^) | z = 1, ...,n}, in which xf = 

^x^^^ , ...,x[^^ ^ and yi G { — 1,1} is the class label of the 
z-th instance. For CTR prediction problem, y = 1 denotes 
click and y — —1 denotes non-click in an impression. The 
task of multi-view classification is to learn a function / : 

X • • • X R^"*^ ^ {—1,1} that correctly predicts the label 
of a test instance. 

In addition, we introduce the concept of tensors which 
are higher order arrays that generalize the notions of vec¬ 
tors (first-order tensors) and matrices (second-order ten¬ 
sors), whose elements are indexed by more than two indexes. 
We state the definition of tensor product and mode-k prod¬ 
uct which will be used to formulate our proposed model. 


Definition 2.1 (Tensor Product or Outer Product). 
The tensor product X oy of a tensor T G and 

another tensor y G is defined by 


for all index values. 


( 1 ) 


complementary information is contained. Here, we consider 
nesting all interactions up to mth-order between m views: 
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mth-order interactions 


Let us add an extra feature with constant value 1 to the 
feature vector x^'^\ i.e., = (x^^^^, 1) G = 

1,..., m. Then, Eq. can be effectively rewritten as: 

^1+ 1 ^ m + l / rn \ 

y=T"-J2 (n TJ ) (4) 

where = /do and ^Vz^; < 

ly. For with some indexes satisfying z^; = 7^; + 1, 

it encodes lower order interaction between views whose iy < 
ly. Hereinafter, let denote where only ip < Ip 

and iy — ly -\-l,v ^ p, and let R’q’q denote where 

H ^ 'iq ^ Iq and z^; = 7^; -f 1, u ^ {p, q\ for other m — 2 
views, etc. 

The number of parameters in Eq. Q is YY^=i{^v + 1), 
which can make the model prone to overfitting and inef¬ 
fective on sparse data. Therefore, we assume that the ef¬ 
fect of interactions has a low rank and the mth-order tensor 
>V = {R’n, - dm} ^ ]^(^i+i)x - x(^rn,+i) factorlzcd into 

k factors: 


Definition 2.2 (Mode-Zc Product). The mode-k prod¬ 
uct T Xfc M of a tensor X G RLx---x/rn ^ matrix 

M G is defined by 

Ik 

(X Xk (2) 

for all index values. 


>V = CxiA^^U2---x^A^”*) (5) 

where A^’ e R(^-+i)xfc^ ^nd C € is the iden- 

tity tensor, i.e., = (5(zi = ••• = Zm)- Basically, 

it is a CANDECOMP/PARAFAC (CP) factorization [2] as 
shown in Figure with element-wise notation = 

YY^=i number of model parameters is re¬ 

duced to Zc^^^(7^; + 1) = k{m + d). It transforms Eq. Q 
into: 


3. MULTI-VIEW MACHINE MODEL 

y = 

3.1 Model Formulation 

The key challenge of multi-view classification is to model We name this model as multi-view machines (MVMs). As 

the interactions between features from different views, wherein shown in Figure the full-order interactions between mul- 
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Figure 2: Multi-view machines. All the interactions of different orders between multiple views are modeled 
in a single tensor and share the same set of latent factors. 


tiple views are modeled in a single tensor, and they are fac¬ 
torized collectively. The model parameters that have to be 
estimated are: 

aW £ ]^(^V+l)Xfc^ 

where the i^-th row within 

describes the ^^;-th feature in the r'-the view with k factors. 
Let the last row denote the bias factor from the r'-th 

view, since it is always combined with = 1 in Eq. 

Hence, 

k m 

+ l. + ^ = E n “/” + !,/ (8) 

/=! V=1 


3.2 Time Complexity 

Next, we show how to make MVMs applicable from a com¬ 
putational point of view. The straightforward time complex¬ 
ity of Eq. ^ is 0{k However, we observe that 

there is no model parameter which directly depends on the 
interactions between variables (e.^., a parameter with an in¬ 
dex (H,^m)), due to the factorization of the interactions. 
Therefore, the time complexity can be largely reduced. 

Lemma 3.1. The model equation of MVMs can he eom- 
puted in linear time 0{k(rn T d)). 

Proof. The interactions in Eq. can be reformulated 
as: 


is the global bias, denoted as wq hereinafter. 

Moreover, MVMs are flexible in the order of interactions 
of interests. That is to say, when there are too many views 
available for a learning task and interactions between some 
of them may obviously be physically meaningless, or some¬ 
times the very high order interactions may be intuitively 
uninterpret able, it is not desirable to include these poten¬ 
tially redundant interactions in the model. In such scenarios, 
one can (1) partition (overlapping) groups of views, (2) con¬ 
struct multiple MVMs on these view groups where the full- 
order interactions within each group are included, and (3) 
implement a coupled matrix/tensor factorization [^. This 
implementation excludes those cross-group interactions. Al¬ 
though MVMs are feasible in any order of interactions, that 
is outside the scope of this paper; our focus is on investi¬ 
gating how to effectively explore the full-order interactions 
within a given set of views. 
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This equation has only linear complexity in both k and ly ■ 
Thus, its time complexity is 0{k{m d)), which is in the 

same order of the number of parameters in the model. □ 


4. LEARNING MULTI-VIEW MACHINES 

To learn the parameters in MVMs, we consider the fol- 













































































































































































lowing regularization framework: 


(18) 


a£^(y(x|e),y) _ -yexp{-y ■ y(x|e)) dy{x\e) 


argmin ^ £(y(x|e), j/) + ASl(e) 


( 10 ) 


{■yi,y)eT> 


where © represents all the model parameters, £(•) is the loss 
function, Q(-) is the regularization term, and A is the trade¬ 
off between the empirical loss and the risk of overfitting. 

Importantly, MVMs can be used to perform a variety of 
machine learning tasks, depending on the choices of the loss 
function. For example, to conduct regression, the square 
error is a popular choice: 

C^{y{yi\&),y) = {y{^\Q)-yf (11) 

and for classification problems, we can use the logit loss: 

£^(y(x|e),y) = log(l +exp(-i/ • y(x|e))) (12) 

or the hinge loss: 

£"(y(x|0),y) = max(0, l-y- y(x|0)) (13) 

The regularization term is chosen based on our prior knowl¬ 
edge about the model parameters. Typically, we can apply 
L2-norm: 


n^2(0) = l|0||^ = y].0? 


(14) 


or Ll-norm: 







where e is a very small number to make the Ll-norm term 
differentiable. 

The model parameters 0 = v — l,...,m} can be 

learned efficiently by alternating least square (ALS), stochas¬ 
tic gradient descent (SGD), L-BFGS, etc., for a variety of 
loss functions, including square error, hinge loss, logit loss, 
etc. From Eq. the gradient of the MVM model is: 

9y(x|0) „(1) (1) 


^(^ + 1) (ll+l) 


where 0 — and = 1 if = G -f 1, otherwise 

= (T. It validates that MVMs possess the multilinear¬ 
ity property, because the gradient along 0 is independent of 

the value of 0 itselfj _ 

Note that in Eq. ( |1^ , the sum 

computed and reused for updating the /-th factor of all the 
features. Hence, each gradient can be computed in 0{m). 
In an iteration, including the precomputation time, all the 
k{m d) parameters can be updated in 0{mk{m + d)). It 
can be even reduced under sparsity, where most of the ele¬ 
ments in X (or z) are 0 and thus, the sums have only to be 
computed over the non-zero elements. 

It is straightforward to embed Eq. (1^ into the gradient 
of the loss functions e.g., Eqs. (11 for direct optimiza¬ 

tion, as follows: 


dC^{y{x\e),y) 
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do 


d£^iyi^\e),y) 

dO 


1 + exp(-i/ • y(x|0)) 


dO 


_y _ ey(xje) if j,. ^(x|e) < 1 

0 otherwise 


(19) 

Moreover, the gradient of the regularization term Q(0) 
can be derived: 


c>n^2(0) 

dO 

an^i(0) _ 
do 


= 20 
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( 20 ) 


( 21 ) 


The SGD optimization method for MVMs is summarized 
in Algorithm^ where the model parameters are first initial¬ 
ized from a zero-mean normal distribution with standard 
deviation cr, and the gradients in line [S] can be computed 
according to Eqs. (11)-(13) and Eqs. (20)-([^. Moreover, 


rather than specifying a learning rate rj beforehand, we can 
use a line search to determine it in the optimization process. 
The regularization parameter A can be searched on a held- 
out validation set. Gonsidering the number of factors /c, the 
performance can usually be improved with larger /c, at the 
cost of more parameters which can make the learning much 
harder in terms of both runtime and memory . 

Algorithm 1 Stochastic Gradient Descent for MVMs 
Input: Training data T> = {(xi,yi) | z = l,...,n}, number 
of factors k, regularization parameter A, learning rate 77, 
standard deviation a 


Output: Model parameters 0 = {A^^^ G M 

1 , 

Initialize ~ J\f{0, cr) 
repeat 

for (x, y) do 
for 7 ; := 1 to m do 

for iv := 1 to G + 1 do 
if z\^J / 0 then 
for / := 1 to /c do 

0^0- ^( 9 C(y(^m,y) ^ 

^ 

end for 
end if 
end for 
end for 
end for 

until convergence 
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where 


5 . RELATED WORK 

In this section, we discuss and compare our proposed 
MVM model with other methods (and extensions) for multi¬ 
view classification, including support vector machines (SVMs), 
support tensor machines (STMs) and factorization machines 
(EMs). 

5.1 SVM Model 

Vapnik introduced support vector machines (SVMs) 
based on the maximum-margin hyperplane. Essentially, SVMs 





























Figure 3: Related work (and extensions) on modeling the interactions between multiple views. In general, 
the linear SVM model is limited to the first-order interactions; the STM model explores only the highest 
order interactions; in spite of including all the interactions in different orders, the FM model is not sufficiently 
factorized compared to our proposed MVM model. 


integrate the hinge loss and the L2-norm regularization. The 
decision function with a linear kernel i£] 


y = ico + ^ WiXi 
i=l 


( 22 ) 

In the multi-view setting, x is simply a concatenation of 
features from different views, ie., x^ = , 

as shown in Figure]^ Thus, Eq. (22) is equivalent to: 


y = Ulo + ^ ^ wi 

'V - 1 Z-j; - 1 




(23) 


Obviously, no interactions between views are explored in 
Eq. (23). By restricting iy — 1^ + 1 for any m — 1 indexes of 
ill Eq. *.e., removing factorization and higher 
order interactions from MVMs, we obtain the linear SVMs: 
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^The sign function is omitted, because the analysis and con¬ 
clusions can easily extend to other generalized linear models, 
e.^., logistic regression. 


Throught the employment of a nonlinear kernel, SVMs 
can implicitly project data from the feature space into a 
more complex high-dimensional space, which allows SVMs 
to model higher order interactions between features. How¬ 
ever, as discussed in [^, all interaction parameters of non¬ 
linear SVMs are completely independent. In contrast, the 
interaction parameters of MVMs are collectively factorized 
and thus dependencies exist when interactions share the 
same feature. 

Eor nonlinear SVMs, there must be enough instances x G 
V where ^ 0 and x^^^ 7 ^ 0 to reliably estimate the 
second-order interaction parameter . The instances 

tp 5 tq 

with either x= 0 or x= 0 cannot be used for estimat¬ 
ing That is to say, on a sparse dataset where there 

are too few or even no cases for some higher order interac¬ 
tions, nonlinear SVMs are likely to degenerate into linear 
SVMs. 

The factorization of interactions in Eq. 0 benefits MVMs 
for parameter estimation under sparsity, since the latent fac¬ 
tor can be learned from any instances whose x-^^ 7 ^ 0 , 
which allows the second-order interaction can be ap- 

proximated from instances whose x 7 ^ 0 or x 7 ^ 0 rather 

than instances whose x,-^^ 7 ^ 0 and x,-^^ 7 ^ 0. Therefore, the 
interaction parameters in MVMs can be effectively learned 
without direct observations of such interactions in a training 
set of sparse data. 

5.2 STM Model 



























Cao et al. investigated multi-view classification by model¬ 
ing interactions between views as a tensor, ie., Af = o 
• • • o G ' and solved the problem in the 

framework of support tensor machines (STMs) [^. Basically, 
as shown in Figure only the highest order interactions are 
explored: 


Jm / rn \ 

y=^ ^ (25) 

zi = l Zm = l \^’ = l / 

where = YY^=i ^ rank-one decomposition 

of the tensor W G - [^. 

However, estimating a lower order interaction {e.g., a pair¬ 
wise one) reliably is easier than estimating a higher order 
one, and lower order interactions can usually explain the 
data sufficiently [^. Thus, it is critical to include the 
lower order interactions in MVMs. Moreover, instead of 
a rank-one decomposition, we apply a higher rank decom¬ 
position of W G R(H+i)x - x(/rn+i) capture more latent 
factors and thereby achieving a better approximation to the 
original interaction parameters. 


5.3 FM Model 

Rendle introduced factorization machines (FMs) that 
combine the advantages of SVMs with factorization models. 
The model equation for a second-order FM is as follows: 

d d d 

y = ICO + ^ WiXi + EE {vi,Vj)xiXj (26) 

i=l i=l j=i-\-l 


where d = YZ=i (vi, Vj) = Y!}=i 

However, the pairwise interactions between all the fea¬ 
tures are included in FMs without consideration of the view 
segmentation. In the multi-view setting, there can be re¬ 
dundant correlations between features within the same view 
which are thereby unworthy of consideration. The coupled 
group lasso model proposed in is essentially an appli¬ 
cation of the second-order FMs to multi-view classification. 


To achieve this purpose, we can simply modify Eq. (26) as: 
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The pairwise interaction parameter 
in Eq. ( [ 2 ^ indicates that can be learned from in¬ 

stances whose x^^J 7 ^ 0 and some x^^J 7 ^ 0 (sharing Vp), or 

x^^^ 7 ^ 0 and some x^^^ 7 ^ 0 (sharing Vg), which makes EMs 
more robust under sparsity than SVMs where only instances 
with 7 ^ 0 and x\^^ 7 ^ 0 can be used to learn . 

The main difference between EMs and MVMs is that the 
interaction parameters in different orders are completely in¬ 
dependent in EMs, e.g., the first-order interaction and 

the second-order interaction in Eq. |27k. On the con¬ 
trary, in MVMs, all the orders of interactions share the 
same set of latent factors, e.g., in Eq. Eor exam¬ 
ple, the combination of and the bias factors from other 

X ? ly 


m - 1 views, i.e., ap¬ 
proximates the first-order interaction Similarly, we 

can obtain the second-order interaction by combining 

aj^^ and other m — 2 bias factors. 

Lp > Lq 

Such difference is more significant for higher order EMs, 
as shown in Eigurej^ Assuming the same number of factors 
in different orders of interactions, the number of parame¬ 
ters to be estimated in a mth-order EM is 1 + (1 + (m — 
l)/c) — {k(m — 1 ) + l)d + 1 which can be much 

larger than k(m + d) in MVMs, when there are many views 
{i.e., a large m). Therefore, compared to MVMs, EMs are 
not fully factorized. 

6. CONCLUSION 

In this paper, we have proposed a multi-view machine 
(MVM) model and presented an efficient inference method 
based on stochastic gradient descent. In general, MVMs can 
be applied to a variety of supervised machine learning tasks, 
including classihcation and regression, and are particularly 
designed for data that is composed of features from mul¬ 
tiple views, between which the interactions are effectively 
explored. In contrast to other models that explore only the 
partial interactions or factorize the interactions in different 
orders separately, MVMs jointly factorize the full-order in¬ 
teractions and thereby benefiting the parameter estimation 
under sparsity. 
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